Lab 5

Author

PSTAT 134/234

Lab 5

This week’s lab will cover the basics (and a bit beyond) of visualizing data using both R’s ggplot2 and Python’s seaborn. We’ll use a few different datasets in the process.

Note that not every plot in this lab is an example of the gold standards for data visualization. The goal of this lab is to give you some basic code that you can modify, combine, and select from, using your understanding of the principles of data visualization, to ultimately create effective visuals.

Exercises are provided at the end of this lab, in their own separate section, rather than throughout. You can do the exercises, as always, in R, Python, or a combination of both. Code folding was implemented for this lab to reduce length. To reveal the code that generated a specific plot, simply click the “Code” drop-down button.

Setup

The following two code chunks set up the libraries we’ll use for this lab in R and then in Python. Note that you’ll need to make sure you have all these libraries installed to render this lab.

Code
options(scipen=999)  # turn off scientific notation like 1e+48
library(ggplot2)
library(tidyverse)
library(ggcorrplot)
library(ggalt)
library(palmerpenguins)
library(ggExtra)
theme_set(theme_bw())  # preset the bw theme.
Code
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

Data

The data sets we’ll use for this lab are:

  1. mpg: Comes from ggplot2. Contains fuel economy data from 1999 to 2008 for 38 popular models of cars. 234 observations and 11 variables.

  2. midwest: Comes from ggplot2. Contains demographic information for midwest counties in the US from the 2000 census. 437 observations and 28 variables.

  3. mtcars: Comes from datasets. Contains information about fuel consumption, automobile design and performance for 32 automobiles (1973-74 models).

  4. penguins: Comes from palmerpenguins. Contains size measurements for adult foraging penguins near Palmer Station, Antarctica. 344 observations and 8 variables.

One Continuous Variable

The following section gives an overview of the plots that you might use to assess the distribution of a single continuous variable. They can also be extended; for example, you can create one histogram per level of a categorical variable.

Histogram

We’ll create a histogram of engine displacement from the mpg dataset, first in R using ggplot2 and then in Python using seaborn. We will specify the width of the bins, but not the number of bins, in R, which allows R to choose the number of bins automatically:

Code
gg <- ggplot(mpg, aes(x = displ)) +
  geom_histogram(binwidth = .1,
                 size = .1) +
  labs(title = "Histogram with Auto Binning",
       subtitle = "Engine Displacement") +
  xlab("Displacement") +
  ylab("Count")

gg

The distribution is positively skewed; the maximum engine displacement is around 7. The median is most likely between 3 and 4.

In Python, we specify the number of bins:

Code
mpg = r.mpg

plt.figure(figsize=(8, 6))
sns.histplot(mpg['displ'], bins=30, kde=False, edgecolor='black', binwidth=0.1)

plt.title("Histogram")
plt.suptitle("Engine Displacement")
plt.xlabel("Displacement")
plt.ylabel("Count")

plt.show()

We might then wonder if the distribution of engine displacement differs across drive trains – for example, if the distribution is different for four wheel drive cars than it is for rear wheel drive, etc. We could add drv as a categorical variable. Let’s do that (again, first in R, then in Python):

Code
gg <- ggplot(mpg, aes(x = displ, fill = factor(drv))) +
  geom_histogram(bins = 6,
                 size = .1,
                 position = "stack") +
  labs(title = "Histogram with Auto Binning",
       subtitle = "Engine Displacement across Vehicle Drive Trains") +
  xlab("Displacement") +
  ylab("Count") +
  scale_fill_discrete(name = "Drive Train")

gg

It’s a bit hard to tell when the histograms are on top of each other like this, but rear-wheel drive cars seem to have more engine displacement, while front-wheel drive cars seem to have less.

Code
plt.figure(figsize=(8, 6))
sns.histplot(data=mpg, x='displ', hue='drv', bins=6, edgecolor='black', stat='count', multiple='stack')

# Add labels and title
plt.title("Histogram")
plt.suptitle("Engine Displacement across Vehicle Drive Trains")
plt.xlabel("Displacement")
plt.ylabel("Count")

# Customize legend
plt.legend(title="Drive Train")

# Show the plot
plt.show()

Or if we would rather see three histograms next to each other in facets, which might make the distributions easier to compare (as opposed to stacking them):

Code
gg <- ggplot(mpg, aes(x = displ)) +
  geom_histogram(bins = 6,
                 size = .1,
                 position = "stack") +
  labs(title = "Histogram with Auto Binning",
       subtitle = "Engine Displacement across Vehicle Drive Trains") +
  xlab("Displacement") +
  ylab("Count") +
  facet_wrap(vars(drv))

gg

Density Plot

A density plot is an alternative smoothed version of a histogram that uses kernel density estimation to create a continuous curve that approximates the underlying probability distribution. Sometimes people add density curves on top of histograms.

Here we make a density plot of engine displacement broken down by drive trains. We also customize the thickness of the lines so they stand out a little more.

Code
gg <- ggplot(mpg, aes(x = displ, color = factor(drv))) +
  geom_density(linewidth = 0.75) +
  labs(title = "Density Curve",
       subtitle = "Engine Displacement across Vehicle Drive Trains") +
  xlab("Displacement") +
  ylab("Density") +
  scale_color_discrete(name = "Drive Train")

gg

Then we reproduce that plot in Python:

Code
plt.figure(figsize=(8, 6))
sns.kdeplot(data=mpg, x='displ', hue='drv', linewidth=0.75)

# Add labels and title
plt.title("Density Curve")
plt.suptitle("Engine Displacement across Vehicle Drive Trains")
plt.xlabel("Displacement")
plt.ylabel("Density")

# Show the plot
plt.show()

We could also map drv to both color and fill and use the alpha value to add transparency to the filled density curves. That aesthetic can take values between \(0\) (completely transparent) and \(1\) (completely opaque).

Code
gg <- ggplot(mpg, aes(x = displ, color = factor(drv),
                      fill = factor(drv))) +
  geom_density(linewidth = 0.75, alpha = 0.5) +
  labs(title = "Density Curve",
       subtitle = "Engine Displacement across Vehicle Drive Trains") +
  xlab("Displacement") +
  ylab("Density") +
  scale_color_discrete(name = "Drive Train") +
  scale_fill_discrete(name = "Drive Train")

gg

We also reproduce that plot in Python:

Code
plt.figure(figsize=(8, 6))
sns.kdeplot(data=mpg, x='displ', hue='drv', fill=True, linewidth=0.75, alpha=0.5)

# Add labels and title
plt.title("Density Curve")
plt.suptitle("Engine Displacement across Vehicle Drive Trains")
plt.xlabel("Displacement")
plt.ylabel("Density")

# Show the plot
plt.show()

Box Plot

Box plots, or box-and-whisker plots, are graphical representations of distributions. They summarize some key information by displaying the minimum, first quartile, median, third quartile, maximum, and outliers. The “box” is formed between the first and third quartiles with a line at the median; “whiskers” extend from the box to the minimum and maximum, with points added at outliers.

We’ll make a box plot of city mileage across vehicle classes in R. Note that we order the boxes by median. The argument varwidth allows the size of the boxes to vary based on how many observations are in that group.

Code
gg <- ggplot(mpg, aes(x = reorder(class, cty, FUN = median),
                      y = cty)) +
  geom_boxplot(varwidth = T) +
  labs(title = "Box Plot",
       subtitle = "City Mileage across Vehicle Classes") +
  xlab("Vehicle Class") +
  ylab("City Mileage")

gg

Pickups and SUVs have the least city mileage, while compacts and subcompacts have the most. There are fewer two-seater cars in the dataset, so their box is smallest.

Here is code to create similar box plots in Python (although without allowing the size to vary based on the number of observations in the group). Challenge: Figure out how to mimic the varwidth argument.

Code
order = mpg.groupby('class')['cty'].median().sort_values().index

# Create a box plot with reordered categories
plt.figure(figsize=(10, 6))
sns.boxplot(data=mpg, x='class', y='cty', order=order, width=0.5)

# Add labels and title
plt.title("Box Plot")
plt.suptitle("City Mileage across Vehicle Classes")
plt.xlabel("Vehicle Class")
plt.ylabel("City Mileage")

# Show the plot
plt.xticks(rotation=45)
([0, 1, 2, 3, 4, 5, 6], [Text(0, 0, 'pickup'), Text(1, 0, 'suv'), Text(2, 0, '2seater'), Text(3, 0, 'minivan'), Text(4, 0, 'midsize'), Text(5, 0, 'subcompact'), Text(6, 0, 'compact')])
Code
plt.show()

Here we add a third variable, number of cylinders, and map it to the color of the boxes:

Code
gg <- ggplot(mpg, aes(x = reorder(class, cty, FUN = median),
                      y = cty, fill = factor(cyl))) +
  geom_boxplot() +
  labs(title = "Box Plot",
       subtitle = "City Mileage across Vehicle Classes and Number of Cylinders") +
  xlab("Vehicle Class") +
  ylab("City Mileage") +
  scale_fill_discrete(name = "Number of Cylinders")

gg

Within classes of vehicle, cars with fewer cylinders tend to have more city mileage. Very few cars have five cylinders. Two-seaters tend to only have eight cylinders. Most minivans are six-cylinder.

And we do the same in Python:

Code
order = mpg.groupby('class')['cty'].median().sort_values().index

# Create a box plot with reordered categories and fill by cylinder count
plt.figure(figsize=(10, 6))
sns.boxplot(data=mpg, x='class', y='cty', order=order, hue='cyl')

# Add labels and title
plt.title("Box Plot")
plt.suptitle("City Mileage across Vehicle Classes and Number of Cylinders")
plt.xlabel("Vehicle Class")
plt.ylabel("City Mileage")

# Show the plot
plt.xticks(rotation=45)
([0, 1, 2, 3, 4, 5, 6], [Text(0, 0, 'pickup'), Text(1, 0, 'suv'), Text(2, 0, '2seater'), Text(3, 0, 'minivan'), Text(4, 0, 'midsize'), Text(5, 0, 'subcompact'), Text(6, 0, 'compact')])
Code
plt.legend(title="Cylinders")
plt.show()

Violin Plot

Violin plots are data visualizations that combine box plots with density plots by displaying the density mirrored along a central axis, which creates an unusual shape called a violin. The shape of the “violin” reflects the density of the distribution at different values, allowing you to see where the data are concentrated. Sometimes there is a smaller box plot presented inside the violin.

We’ll recreate the plot of city mileage by vehicle class, this time with violins instead of boxes:

Code
gg <- ggplot(mpg, aes(x = reorder(class, cty, FUN = median),
                      y = cty)) +
  geom_violin() +
  labs(title = "Violin Plot",
       subtitle = "City Mileage across Vehicle Classes") +
  xlab("Vehicle Class") +
  ylab("City Mileage")

gg

And we’ll do the same in Python. Notice that Python adds a box within the violins by default:

Code
order = mpg.groupby('class')['cty'].median().sort_values().index

# Create a violin plot with reordered categories
plt.figure(figsize=(10, 6))
sns.violinplot(data=mpg, x='class', y='cty', order=order)

# Add labels and title
plt.title("Violin Plot")
plt.suptitle("City Mileage across Vehicle Classes")
plt.xlabel("Vehicle Class")
plt.ylabel("City Mileage")

# Show the plot
plt.xticks(rotation=45)
([0, 1, 2, 3, 4, 5, 6], [Text(0, 0, 'pickup'), Text(1, 0, 'suv'), Text(2, 0, '2seater'), Text(3, 0, 'minivan'), Text(4, 0, 'midsize'), Text(5, 0, 'subcompact'), Text(6, 0, 'compact')])
Code
plt.show()

Dot Plot

A dot plot is a simple graphical representation of data where the individual data points are represented by dots. This makes it fairly easy to see the distribution of the data. They can be oriented horizontally or vertically.

We’ll recreate the plot of city mileage by vehicle class, this time as a dot plot:

Code
gg <- ggplot(mpg, aes(x = reorder(class, cty, FUN = median),
                      y = cty)) +
  geom_dotplot(binaxis = 'y',
               stackdir = 'center',
               dotsize = .5) +
  labs(title = "Dot Plot",
       subtitle = "City Mileage across Vehicle Classes") +
  xlab("Vehicle Class") +
  ylab("City Mileage")

gg

And then again in Python:

Code
order = mpg.groupby('class')['cty'].median().sort_values().index

# Create a dot plot with reordered categories
plt.figure(figsize=(10, 6))
sns.stripplot(data=mpg, x='class', y='cty', order=order, jitter=True, color='black')

# Add labels and title
plt.title("Dot Plot")
plt.suptitle("City Mileage across Vehicle Classes")
plt.xlabel("Vehicle Class")
plt.ylabel("City Mileage")

# Show the plot
plt.xticks(rotation=45)
([0, 1, 2, 3, 4, 5, 6], [Text(0, 0, 'pickup'), Text(1, 0, 'suv'), Text(2, 0, '2seater'), Text(3, 0, 'minivan'), Text(4, 0, 'midsize'), Text(5, 0, 'subcompact'), Text(6, 0, 'compact')])
Code
plt.show()

One Categorical Variable

The following section gives an overview of the plots that you might use to assess the distribution of a single categorical variable. They can also be extended.

Bar Chart

A bar chart is a graphical representation of the distribution of a categorical variable, using rectangular bars to show the frequency or value of different categories. They can be displayed either vertically or horizontally. They are effective tools for comparing the values or counts of different categories.

Here, we make a bar chart displaying the most common car manufacturers in the mpg data. Note that we order the bars by count, which is a common technique.

Code
gg <- mpg %>% 
  ggplot(aes(y = fct_infreq(manufacturer))) + 
  geom_bar() +
  labs(title = "Bar Chart",
       subtitle = "Manufacturer of vehicles") +
  xlab("Count") +
  ylab("Manufacturer")

gg

Most cars are manufactured by Dodge, then Toyota, then Volkswagen, then Ford.

And the same in Python:

Code
manufacturer_counts = mpg['manufacturer'].value_counts().reset_index()
manufacturer_counts.columns = ['manufacturer', 'count']

# Create a bar chart
plt.figure(figsize=(10, 6))
sns.barplot(data=manufacturer_counts, y='manufacturer', x='count', order=manufacturer_counts['manufacturer'])

# Add labels and title
plt.title("Bar Chart")
plt.suptitle("Manufacturer of Vehicles")
plt.xlabel("Count")
plt.ylabel("Manufacturer")

# Show the plot
plt.show()

Proportion stacked bar charts are a variation of stacked bar charts that display the relative proportions of different categories within a whole. Each bar represents a specific category, and segments of the bar show the contributions of subcategories. The height (or length) of each segment reflects the proportion of that subcategory, rather than the count. All bars in the chart typically have the same height, making it easy to compare the distribution of subcategories.

Here we make this plot to compare the classes of vehicle across drive trains. This will tell us which vehicle classes more commonly feature which drive trains.

Code
gg <- mpg %>% 
  ggplot(aes(x = factor(drv))) + 
  geom_bar(aes(fill = factor(class)),
           position = "fill") +
  labs(title = "Proportion Stacked Bar Chart",
       subtitle = "Classes of vehicle by drive train") +
  xlab("Drive Train") +
  ylab("Proportion") +
  scale_fill_discrete(name = "Class")

gg

As you can see, the plot makes it easy to tell that two-seater cars are always rear-wheel drive, while the majority of front-wheel drive cars are midsize or compact, and the majority of four-wheel drive cars are SUVs or pickups. No SUVs or pickup trucks are front-wheel drive.

We can reproduce this plot in Python:

Code
mpg = r.mpg
# Calculate counts of each class by drive train
count_data = mpg.groupby(['drv', 'class']).size().reset_index(name='count')

# Calculate the total counts for each drive train
total_counts = count_data.groupby('drv')['count'].transform('sum')

# Calculate the proportion
count_data['proportion'] = count_data['count'] / total_counts

# Create a pivot table for better plotting
pivot_data = count_data.pivot(index='drv', columns='class', values='proportion').fillna(0)

# Plotting
plt.figure(figsize=(10, 6))

# Stack the bars
pivot_data.plot(kind='bar', stacked=True, ax=plt.gca(), alpha=0.8)

# Set y-axis limits
plt.ylim(0, 1)
(0.0, 1.0)
Code
# Add labels and title
plt.title("Proportion Stacked Bar Chart")
plt.suptitle("Classes of Vehicle by Drive Train")
plt.xlabel("Drive Train")
plt.ylabel("Proportion")

# Customize the legend
plt.legend(title="Class", bbox_to_anchor=(1.05, 1), loc='upper left')

# Show the plot
plt.show()

Notice that the default colors are different!

Lollipop Chart

Code
gg <- mpg %>% 
  group_by(class) %>% 
  summarize(cty_m = mean(cty)) %>% 
  ggplot(aes(x = reorder(class, cty_m), y = cty_m)) + 
  geom_point(size = 3) +
  geom_segment(aes(x = class, xend = class, y = 0,
                   yend = cty_m)) +
  labs(title = "Lollipop Chart",
       subtitle = "City mileage by vehicle class") +
  xlab("Vehicle Class") +
  ylab("Mean City Mileage")

gg

Code
cty_m = mpg.groupby('class')['cty'].mean().reset_index(name='cty_m').sort_values('cty_m')

# Create a lollipop chart
plt.figure(figsize=(10, 6))

# Plot the segments
for index, row in cty_m.iterrows():
    plt.plot([row['class'], row['class']], [0, row['cty_m']], color='gray', linestyle='--')

# Plot the points
sns.scatterplot(data=cty_m, x='class', y='cty_m', color='blue', s=100)

# Add labels and title
plt.title("Lollipop Chart")
plt.suptitle("City Mileage by Vehicle Class")
plt.xlabel("Vehicle Class")
plt.ylabel("Mean City Mileage")

# Rotate x-axis labels for better visibility
plt.xticks(rotation=45)
([0, 1, 2, 3, 4, 5, 6], [Text(0, 0, 'pickup'), Text(1, 0, 'suv'), Text(2, 0, '2seater'), Text(3, 0, 'minivan'), Text(4, 0, 'midsize'), Text(5, 0, 'compact'), Text(6, 0, 'subcompact')])
Code
# Show the plot
plt.show()

Dot Plot

Code
gg <- mpg %>% 
  group_by(class) %>% 
  summarize(cty_m = mean(cty)) %>% 
  ggplot(aes(y = reorder(class, cty_m), x = cty_m)) + 
  geom_point(size = 3) +
  geom_segment(aes(x = min(cty_m), xend = max(cty_m), y = class,
                   yend = class), linetype = "dashed",
               size = 0.1) +
  labs(title = "Dot Plot",
       subtitle = "City mileage by vehicle class") +
  xlab("Mean City Mileage") +
  ylab("Vehicle Class")

gg

Code
cty_m = mpg.groupby('class')['cty'].mean().reset_index(name='cty_m').sort_values('cty_m')

# Create a categorical variable for sorting
cty_m['class'] = pd.Categorical(cty_m['class'], categories=cty_m['class'], ordered=True)

# Create a dot plot
plt.figure(figsize=(10, 6))

# Plot the points
sns.scatterplot(data=cty_m, y='class', x='cty_m', color='blue', s=100)

# Add horizontal dashed lines for the mean city mileage
plt.hlines(y=cty_m['class'], xmin=cty_m['cty_m'].min(), xmax=cty_m['cty_m'].max(),
            color='gray', linestyle='--', linewidth=0.5)

# Add labels and title
plt.title("Dot Plot")
plt.suptitle("City Mileage by Vehicle Class")
plt.xlabel("Mean City Mileage")
plt.ylabel("Vehicle Class")

# Show the plot
plt.show()

Two or More Continuous Variables

The plots in this section are generally used to assess the relationship, or correlation, between two or more continuous variables.

Scatterplot

Probably the most frequently used plot for data analysis – the scatterplot. The basic elements of a scatterplot are simply (1) a continuous variable on the x-axis and (2) a continuous variable on the y-axis. Some scatterplots may add a line representing a function of the y-axis variable on the x-axis variable (often linear regression, but sometimes a smooth curve). Some may also add confidence interval bars to that line.

This plot uses data from US counties in the midwest to visualize the relationship between population (poptotal) on the y-axis and area on the x-axis.

Code
data("midwest", package = "ggplot2")

gg <- ggplot(midwest, aes(x = area, y = poptotal)) + 
  geom_point() + 
  geom_smooth(method = "loess", se = F) + 
  xlim(c(0, 0.1)) + 
  ylim(c(0, 500000)) + 
  labs(subtitle = "Area vs. Population", 
       y = "Population", 
       x = "Area", 
       title = "Scatterplot", 
       caption = "Source: midwest")

plot(gg)

Code
midwest = r.midwest

plt.figure(figsize=(10, 6))

# Scatter plot
sns.scatterplot(data=midwest, x='area', y='poptotal')

# LOESS smoothing line
sns.regplot(data=midwest, x='area', y='poptotal', lowess=True, scatter=False, color='orange')

# Set limits for x and y axes
plt.xlim(0, 0.1)
(0.0, 0.1)
Code
plt.ylim(0, 500000)
(0.0, 500000.0)
Code
# Add labels and title
plt.title("Scatterplot")
plt.suptitle("Area vs. Population")
plt.xlabel("Area")
plt.ylabel("Population")
plt.figtext(0.5, 0.01, "Source: midwest", ha="center", fontsize=10)

# Show the plot
plt.show()

If you want to flag certain observations to draw attention to them in particular, you can use geom_encircle() from the package ggalt. Within that function, specify a new data frame that contains only the points of interest. You can use the size argument to control the thickness of the line being drawn and expand if you want the line to pass outside the points.

Here we draw a circle around those points in the top region of the plot, with a total population above \(350,000\):

Code
midwest_selection <- midwest %>% 
  filter(poptotal > 350000)

gg <- ggplot(midwest, aes(x = area, y = poptotal)) + 
  geom_point() +  
  geom_smooth(method = "loess", se = F) + 
  xlim(c(0, 0.1)) + 
  ylim(c(0, 500000)) +
  geom_encircle(aes(x = area, y = poptotal), 
                data = midwest_selection, 
                color = "red", 
                size = 2, 
                expand = 0.08) +   # encircle
  labs(subtitle = "Area vs. Population", 
       y = "Population", 
       x = "Area", 
       title = "Scatterplot + Encircle", 
       caption = "Source: midwest")
plot(gg)

Integer Data

For this example, rather than making a scatterplot of the relationship between population and area, let’s look at a different dataset and make a scatterplot of the variables cty (city mileage) and hwy (highway mileage) for different cars.

Code
data(mpg, package="ggplot2")

gg <- ggplot(mpg, aes(cty, hwy)) + 
  geom_point() + 
  labs(subtitle = "City vs. Highway Mileage", 
       y = "Highway Mileage", 
       x = "City Mileage", 
       title = "Scatterplot with overlapping points", 
       caption = "Source: mpg")

plot(gg)

This is a basic scatterplot. Notice something about it? Well, although there are technically 234 observations in the mpg dataset, this plot appears to show a much smaller number. This is because cty and hwy are integers:

Code
mpg %>% 
  select(cty, hwy) %>% 
  head()
# A tibble: 6 × 2
    cty   hwy
  <int> <int>
1    18    29
2    21    29
3    20    31
4    21    30
5    16    26
6    18    26

So if we treat them as continuous random variables, because the observations can only take on integer values (\(18\) and not, for example, \(18.02\)), there are many overlapping points plotted. This gives the scatterplot almost a grid-like appearance.

Jitter

One way we can deal with this is by adding some random variation, or “jitter,” to the points. The width argument controls how much variation is added; if you increase the width, the points will be more spread out.

Code
gg <- ggplot(mpg, aes(cty, hwy)) + 
  geom_jitter(width = .75) + 
  labs(subtitle = "City vs. Highway Mileage", 
       y = "Highway Mileage", 
       x = "City Mileage", 
       title = "Scatterplot with jitter", 
       caption = "Source: mpg")

plot(gg)

Count

The other way we can handle this is by creating a “counts” plot. We can use geom_count, which will increase the size of the circle based on the number of overlapping points. (Note that we use show.legend=F to suppress the legend here.)

Code
gg <- ggplot(mpg, aes(cty, hwy)) + 
  geom_count(show.legend = F) + 
  labs(subtitle = "City vs. Highway Mileage", 
       y = "Highway Mileage", 
       x = "City Mileage", 
       title = "Counts plot", 
       caption = "Source: mpg")

plot(gg)

Marginal Distributions

Another variation of the scatterplot involves adding visualizations of the marginal distributions of the X and Y variables. This can be via histograms, boxplots, or density curves. Here is a version of our initial scatterplot, area vs. population, with marginal histograms added:

Code
gg <- ggplot(midwest, aes(x = area, y = poptotal)) + 
  geom_point() + 
  geom_smooth(method = "loess", se = F) + 
  xlim(c(0, 0.1)) + 
  ylim(c(0, 500000)) + 
  labs(subtitle = "Area vs. Population", 
       y = "Population", 
       x = "Area", 
       title = "Scatterplot with Marginal Histograms", 
       caption = "Source: midwest")

ggMarginal(gg, type = "histogram")

Or with marginal boxplots:

Code
gg <- ggplot(midwest, aes(x = area, y = poptotal)) + 
  geom_point() + 
  geom_smooth(method = "loess", se = F) + 
  xlim(c(0, 0.1)) + 
  ylim(c(0, 500000)) + 
  labs(subtitle = "Area vs. Population", 
       y = "Population", 
       x = "Area", 
       title = "Scatterplot with Marginal Boxplots", 
       caption = "Source: midwest")

ggMarginal(gg, type = "boxplot")

These types of plots are useful when you want to visualize both the bivariate and univariate relationships at the same time.

Correlogram

A correlogram is a visualization used to display the correlation between multiple continuous variables in a dataset. It’s typically presented as a color-coded matrix, where the colors indicate the strength and direction of the relationship, varying from strong negative (cool colors, typically) to strong positive (warm colors typically). The diagonal of the matrix is usually the correlation of each variable with itself (which is always \(1\)). Often people only display either the upper or lower triangle of the matrix; the other triangle has redundant information, and presenting too much information at once can overwhelm the viewer.

For this we’ll introduce the mtcars data. We would continue using mpg, but that dataset has a lot of categorical variables, whereas mtcars has a variety of continuous ones.

Code
data("mtcars")
corr <- round(cor(mtcars), 1)

ggcorrplot(corr,
           type = "lower",
           lab = TRUE,
           lab_size = 3,
           method = "circle",
           title = "Correlogram of mtcars data")

Exercises

For this lab, you will create some visualizations using the penguins data. The following code displays the first few rows of the data in R.

Code
penguins %>% 
  head()
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>
  1. Create a histogram of penguin body mass in grams.

  2. Create a density curve of penguin body mass in grams.

  3. Create box plots of body mass in grams by penguin species.

  4. Create density curves of body mass in grams by penguin species.

  5. Create a percent stacked bar chart displaying the species of penguins on each island.

  6. Create a scatterplot with body mass on the y-axis and flipper length on the x-axis. Color-code the points by penguin species. Add marginal histograms.

  7. Create a correlogram of the continuous variables in the penguin dataset.